import httpimport
url='https://raw.githubusercontent.com/ProfKauf/Modules/main/'
with httpimport.remote_repo(url):
import profK_libraries, profK_statistics
from profK_libraries import *
from profK_statistics import *
#data
data= pd.read_excel ('C:\\Users\\kauffeldt\\Dropbox\\Teaching\\3_Programme\\Data\\WithCodebooks\\ResearchData\\YoungPeopleSurvey\\YoungPeopleSurvey.xlsx')
data.head()
| Music | Slow songs or fast songs | Dance | Folk | Country | Classical music | Musical | Pop | Rock | Metal or Hardrock | ... | Age | Height | Weight | Number of siblings | Gender | Left - right handed | Education | Only child | Village - town | House - block of flats | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5.0 | 3.0 | 2.0 | 1.0 | 2.0 | 2.0 | 1.0 | 5.0 | 5.0 | 1.0 | ... | 20.0 | 163.0 | 48.0 | 1.0 | female | right handed | college/bachelor degree | no | village | block of flats |
| 1 | 4.0 | 4.0 | 2.0 | 1.0 | 1.0 | 1.0 | 2.0 | 3.0 | 5.0 | 4.0 | ... | 19.0 | 163.0 | 58.0 | 2.0 | female | right handed | college/bachelor degree | no | city | block of flats |
| 2 | 5.0 | 5.0 | 2.0 | 2.0 | 3.0 | 4.0 | 5.0 | 3.0 | 5.0 | 3.0 | ... | 20.0 | 176.0 | 67.0 | 2.0 | female | right handed | secondary school | no | city | block of flats |
| 3 | 5.0 | 3.0 | 2.0 | 1.0 | 1.0 | 1.0 | 1.0 | 2.0 | 2.0 | 1.0 | ... | 22.0 | 172.0 | 59.0 | 1.0 | female | right handed | college/bachelor degree | yes | city | house/bungalow |
| 4 | 5.0 | 3.0 | 4.0 | 3.0 | 2.0 | 4.0 | 3.0 | 5.0 | 3.0 | 1.0 | ... | 20.0 | 170.0 | 59.0 | 1.0 | female | right handed | secondary school | no | village | house/bungalow |
5 rows × 150 columns
If the level of measurement of the target variable is at least quantitative:
Distribution Plots:
plots.dist(data['Weight'],fig=[6,4],labelsize=14,ticksize=10,legsize=10,linewidth=1.5)
plots.dist(data=data,var='Weight',groupvar='Gender',fig=[6,4],labelsize=14,ticksize=10,legsize=10,linewidth=1.5)
Bar Plots:
plt.figure(figsize=(4,4))
sns.barplot(y='Height',data=data,ci='sd',capsize=.1,errwidth=1.5)
<Axes: ylabel='Height'>
plt.figure(figsize=(4,4))
sns.barplot(x='Gender',y='Height',data=data,ci='sd',capsize=.1,errwidth=1.5)
<Axes: xlabel='Gender', ylabel='Height'>
If the level of measurement of the target variable is at least ordinal:
Box Plot:
sns.boxplot(x='Education',y='Age',data=data)
plt.xticks(rotation=30)
(array([0, 1, 2, 3, 4, 5]), [Text(0, 0, 'college/bachelor degree'), Text(1, 0, 'secondary school'), Text(2, 0, 'primary school'), Text(3, 0, 'masters degree'), Text(4, 0, 'doctorate degree'), Text(5, 0, 'currently a primary school pupil')])
Cat Plot:
sns.catplot(x='Education',y='Age',data=data)
plt.xticks(rotation=30)
([0, 1, 2, 3, 4, 5, 6], [Text(0, 0, 'college/bachelor degree'), Text(1, 0, 'secondary school'), Text(2, 0, 'primary school'), Text(3, 0, 'masters degree'), Text(4, 0, 'doctorate degree'), Text(5, 0, 'currently a primary school pupil'), Text(6, 0, 'nan')])
Violin Plot:
sns.catplot(data=data, y="Education", x="Age", kind="violin", color=".9", inner=None)
sns.swarmplot(data=data, y="Education", x="Age", size=3)
<Axes: xlabel='Age', ylabel='Education'>
sns.catplot(data=data, y="Education", x="Age", hue='Gender', kind="violin",inner=None, split=True)
<seaborn.axisgrid.FacetGrid at 0x17c21959c10>
If the level of measurement of the target variable is at least nominal:
sns.countplot(x=data['Gender'])
<Axes: xlabel='Gender', ylabel='count'>
sns.countplot(x=data['Education'],hue=data['Gender'])
plt.xticks(rotation=30)
(array([0, 1, 2, 3, 4, 5]), [Text(0, 0, 'college/bachelor degree'), Text(1, 0, 'secondary school'), Text(2, 0, 'primary school'), Text(3, 0, 'masters degree'), Text(4, 0, 'doctorate degree'), Text(5, 0, 'currently a primary school pupil')])
data_des=data[['Gender','Movies','Age']]
data_des=data_des.dropna()
des=describe.data(data_des,ordinal=['Movies'],nominal=['Gender'])
des.table()
| Age | |
|---|---|
| count | 992.000000 |
| mean | 20.423387 |
| std | 2.808409 |
| min | 15.000000 |
| 25% | 19.000000 |
| 50% | 20.000000 |
| 75% | 22.000000 |
| max | 30.000000 |
des.table(show='ordinal')
| Movies | |
|---|---|
| count | 992.0 |
| categories | 5.0 |
| iqr | 1.0 |
| min | 1.0 |
| 25% | 4.0 |
| 50% | 5.0 |
| 75% | 5.0 |
| max | 5.0 |
des.table(show='nominal')
| Gender | |
|---|---|
| count | 992 |
| mode | female |
| categories | 2 |
| least freq | male(40.93%) |
| most freq | female(59.07%) |
Types of Tests
Steps when running a test
We will explain these steps with the help of a Two-sample t-Test on mean difference.
Step 1: Research Question: Are men on average taller than women?
Hypotheses: $H0: mean\:\:height\:\: men \le mean\:\: height\:\: women$ and $HA: mean\:\: height\:\: men > mean\:\: height \:\:women$
Slice Data
Keep only those columns you need.
data_ttest=data[['Height','Gender']]
data_ttest.head(2)
| Height | Gender | |
|---|---|---|
| 0 | 163.0 | female |
| 1 | 163.0 | female |
Remove Nan
NaN = not a number = missing values (must be removed)
nan=dataprep.nan(data_ttest)
analysis:
nan.analysis
| Column | Missing Values | ||
|---|---|---|---|
| Analysis Missing Values | Height | 20 | |
| Gender | 6 | ||
| Number of Rows with NaNs | 25 |
drop rows with missing values:
data_ttest2=nan.drop
data_ttest.shape
(1010, 2)
data_ttest2.shape
(985, 2)
Encoding
Not necessary -> gender is the grouping variable and height is quantitative.
Graphical (here Barplot because height is quantitative)
plt.figure(figsize=(4,4))
sns.barplot(x='Gender',y='Height',data=data_ttest2,ci='sd',capsize=.1,errwidth=1.5)
<Axes: xlabel='Gender', ylabel='Height'>
Groupwise Descriptive Statistics
data_ttest2.groupby('Gender').describe().round(3)
| Height | ||||||||
|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | |
| Gender | ||||||||
| female | 580.0 | 167.771 | 7.520 | 62.0 | 164.0 | 168.0 | 172.0 | 186.0 |
| male | 405.0 | 181.758 | 6.965 | 159.0 | 178.0 | 182.0 | 186.0 | 203.0 |
Assumptions t-Test
T1. Dependent variable is quantitative
Height is a quantitative variable, so this is true.
T2. Populations follow a normal distribution
plots.dist(data=data_ttest2,var='Height',groupvar='Gender',fig=[6,4],labelsize=14,ticksize=10,legsize=10,linewidth=1.5)
plots.qq(data=data_ttest2,var='Height',groupvar='Gender',fig=[6,4],labelsize=14,ticksize=10,legsize=10,dotsize=40)
T3. Independent measurements
T4. No Outliers
What are outliers?
Method zscore:
An observation is a potential outlier if it is more than 3 standard deviations away from the mean.
Method interquartile range (iqr)
interquartile range = q3(quartile 3) - q1(quartile 1)
An observation is a potential outlier if it is < q1 - 1.5iqr or > q3 + 1.5 iqr
Method median absolute deviation (mad)
mad = median of the absolute value of the differences between observation and data median ( = median(|observation - M|))
An observation is a potential outlier if it is < M - 2.24mad or > M + 2.24mad
Outlier Analysis
seperate data set according to groups:
groupdata=dataprep.group_sep(data=data_ttest2,groupvar='Gender')
out_female=outlier.univariate(groupdata[0]['Height'])
out_male=outlier.univariate(groupdata[1]['Height'])
out_female.analysis
| method | pot. outlier | proportion | |
|---|---|---|---|
| extreme value | zscore | 1 | 0.17% |
| analysis | iqr | 23 | 3.97% |
| mad | 90 | 15.52% | |
| E[ND] (>3 std from mean) | 1 | 0.27% |
out_male.analysis
| method | pot. outlier | proportion | |
|---|---|---|---|
| extreme value | zscore | 3 | 0.74% |
| analysis | iqr | 25 | 6.17% |
| mad | 79 | 19.51% | |
| E[ND] (>3 std from mean) | 1 | 0.27% |
Iqr and mad identify way too many data points as potential outliers. These leaves us with method 'zscore'. Let's have a look at the potential outliers:
groupdata[0].loc[out_female.show(method='zscore')]
| Height | Gender | |
|---|---|---|
| 676 | 62.0 | female |
Height of 62 cm seems to be a measurement error that needs to be removed.
data_ttest3=data_ttest2.drop(out_female.show(method='zscore'))
groupdata[1].loc[out_male.show(method='zscore')]
| Height | Gender | |
|---|---|---|
| 85 | 159.0 | male |
| 547 | 203.0 | male |
| 799 | 203.0 | male |
These heights are not necessarily measurement erros.
As a further robustness test, we run the Tietjen-Moore test to test if it is likely that there are 3 or 2 outliers in the data.
Reference: Tietjen and Moore (1972): Some Grubbs-Type Statistics for the Detection of Outliers, Technometrics, 14, pp. 583-597
outliers_tietjen(groupdata[1]['Height'],k=3,hypo=True)
False
outliers_tietjen(groupdata[1]['Height'],k=2,hypo=True)
False
There seem to be no outliers for male.
T5. Independent variable is binary
True. In this data set gender is binary.
T6. Homogeneity (equal variances in groups)
We test this requirement by running a Levene's test of equal variances:
H0: The variances of all groups are equal and HA: The variances of at least two groups differ
tests.equal_var.levene(data=data_ttest3,var='Height',groupvar='Gender',rem=True)
| var | group | f | dof1 | dof2 | p-val | remark | ||
|---|---|---|---|---|---|---|---|---|
| Levenes Test | Height | Gender | Mean | 8.879683 | 1 | 982 | 0.002955 | for symmetric, moderate-tailed distributions |
| of Equal Variances | Median | 8.720196 | 1 | 982 | 0.003222 | for skewed (non-normal) distributions | ||
| Trimmed | 10.901975 | 1 | 982 | 0.000999 | for heavy-tailed distributions |
We may take the Levene's test based on mean. Here the p-value is 0.3% < 5%. Hence, we can reject the null hypothesis.
Remark: degrees of freedom of Levene's Test
tests.t.two_sample(data=data_ttest3,var='Height',groupvar='Gender',alternative='less').round(4)
| var | group | mean | variances | t | dof | alternative | p-val | CI95% | cohen-d | BF10 | power | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Two-Sample | Height | female | 167.9534 | equal | -32.9292 | 982.000 | less | 0.0 | [-inf, -13.11] | 2.1331 | 9.123e+156 | 1.0 |
| t-Test | male | 181.7580 | unequal | -32.1730 | 794.406 | 0.0 | [-inf, -13.1] | 2.1331 | 6.777e+151 | 1.0 |
The p-value is 0% < 5%. Hence, we can reject the null hypothesis and have evidence that men are on average taller than women. The power is 1, so the hypothesis test is very good at detecting a false null hypothesis
Remark: degrees of freedom of t-test. When computing a mean, we lose one degree of freedmom. Hence:
Effect size $$Cohen's \:d = \frac{mean1 - mean2}{pooled\: standard\: deviation}$$
You get less robust results. Sometimes this could mean that you must use other tests. For example:
Overview:
Example: Does weight increase in height?
plots.scatter(data['Height'],data['Weight'],fig=[6,4],dotsize=25,labelsize=14,ticksize=12)
The covariance measures the linear relationship between two variables:
Unfortunately, the covariance depends on the unit of measurement:
data['Height'].cov(data['Weight'])
94.6291717912906
height_in_m=data['Height']/100
height_in_m.cov(data['Weight'])
0.9462917179129056
Therefore, we use a standardized version of the covariance: Pearson's correlation coefficient r:
where the covariance of the variables is divided by the product of their standard deviations.
The correlation coeffient can only take on values between -1 and +1, where -1 indicates a perfect negative linear relationship and +1 a perfect positive linear relationship:
Test the correlation
Step 1. Research Question and Hypotheses
Height and weight are positively correlated:
$H0: r\le 0$ and $HA: r>0$
Step 2. Data preprocessing
data_corr=data[['Height','Weight']] #slice the data
nan=dataprep.nan(data_corr) #remove nan
nan.analysis
| Column | Missing Values | ||
|---|---|---|---|
| Analysis Missing Values | Height | 20 | |
| Weight | 20 | ||
| Number of Rows with NaNs | 30 |
data_corr=nan.drop
Step 3. Pre-analyses
correlation matrix
cm=describe.corrmat(data_corr)
cm.table.round(3)
| Height | Weight | |
|---|---|---|
| Height | 1.000 | 0.698 |
| Weight | 0.698 | 1.000 |
cm.heatmap(fig=[6,4],nsize=35,lsize=14)
cm2=describe.corrmat(data_corr,utri=False)
cm2.table
| Height | Weight | |
|---|---|---|
| Height | ||
| Weight | 0.6977 |
Step 4. Check Requirements.
Assumptions Pearson Correlation Test
Height and weight are quantitative (PCC1 true) and we may assume that no subject submitted twice as well as no strong family connections (PCC4 true).
PCC2. Bivariate normal.
plots.dist(data_corr['Height'],fig=[6,4],labelsize=14,ticksize=10,legsize=10,linewidth=1.5)
plots.dist(data_corr['Weight'],fig=[6,4],labelsize=14,ticksize=10,legsize=10,linewidth=1.5)
-> seems to be fairly normal.
PCC3. No outlier.
out=outlier.multivariate(data_corr[['Height']],data_corr['Weight'])
out.analysis
| method | pot. outlier | proportion | |
|---|---|---|---|
| extreme value | Cook | 28 | 2.86% |
| analysis | Mahalanobis | 1 | 0.1% |
Cook seems to much. Let's go with Mahalanobis $\rightarrow$ Visualize the outlier.
plots.outlier(x=data_corr[['Height']],y=data_corr['Weight'],method='Mahalanobis',dtype='multivariate')
plots.outlier(x=data_corr[['Height']],y=data_corr['Weight'],method='Mahalanobis',dtype='bivariate')
This is the same outlier as before -> removal
data_corr=data_corr.drop(out.show(method='Mahalanobis'))
Step 5. Run test and interprete
tests.correlation.simple(data_corr['Height'],data_corr['Weight'],alternative='greater').round(4)
| var1 | var2 | n | r (pearson) | CI95% | alternative | p-val | BF10 | power | |
|---|---|---|---|---|---|---|---|---|---|
| pearson Test of Correlation | Height | Weight | 979 | 0.7364 | [0.71, 1.0] | greater | 0.0 | 2.392e+164 | 1.0 |
The p-value is 0% < 5%. Hence, we can reject the null hypothesis and have evidence that there is a positive correlation. The power is 1, so the hypothesis test is very good at detecting a false null hypothesis
Effect size
Furthermore $r^2\approx 54\%$ means that the variables share 54% of their variance.
In case of ordinal variables, we cannot compute the covariance as we cannot compute a mean. Correlation coefficients are determined using rank-based approaches that order the data from lowest to highest and assign a rank to each observation depending on its position. Popular rank-based correlation coefficients are:
Spearman's $\rho$
This coefficient works just like Pearson's r with the difference that it computes the covariance and the standard deviations with respect to the ranks instead of the values of the variable:
$$r^S=\frac{cov(rank(X),rank(Y))}{std(rank(X))\cdot std(rank(Y))}$$
Kendall's $\tau$, Goodman and Kruskal's $\gamma$
These coefficients are based on the number of concordant and discordant pairs in a data set. Given two variables X and Y, two pairs of observations $(x_i,y_i)$ and $(x_j,y_j)$ are
Relationship between Spearman and Kendall coefficient: $Kendall \approx 0.7\cdot Spearman$
Rank-based approaches identify more general monotonic relationships:
A coeffient = 1 indicates a perfectly positive monotonic relationship:
Correlation Test
Step 1. Research Question:
Does preference for classical music increase in education? $HA: r^S\le 0$ and $HA: r^S> 0$
Step 2. Data Preprocessing
data_rank=data[['Education','Classical music']] #slice data
data_rank=data_rank.dropna() #drop NaNs
Education is not encoded yet. Thus, we need to encode it:
enc=dataprep.encoder(order={'Education':['currently a primary school pupil','secondary school','primary school',
'college/bachelor degree', 'masters degree','doctorate degree']})
data_rank2=enc.fit_transform(data_rank)
data_rank2.Education.unique()
array([3., 1., 2., 4., 5., 0.])
Step 3. Pre-analyses
plots.scatter(data_rank2['Education'],data_rank2['Classical music'],fig=[6,4],ticksize=10,labelsize=12)
This is not really informative -> we need to enable the ordinal option.
plots.scatter(data_rank2['Education'],data_rank2['Classical music'],fig=[6,4],ticksize=10,labelsize=12, ordinal=True)
Does rather look like a non-monotonic relationship.
Step 4. Check assumptions
Assumptions Rank-based Correlation Tests
Education and preference for classical music are ordinal. Furthermore, we may assume independence.
Step 5. Run test and interprete.
tests.correlation.simple(data_rank2['Education'],data_rank2['Classical music'],alternative='greater',method='spearman').round(4)
| var1 | var2 | n | r (spearman) | CI95% | alternative | p-val | power | |
|---|---|---|---|---|---|---|---|---|
| spearman Test of Correlation | Education | Classical music | 1002 | 0.0276 | [-0.02, 1.0] | greater | 0.1918 | 0.2197 |
tests.correlation.simple(data_rank2['Education'],data_rank2['Classical music'],alternative='greater',method='kendall').round(4)
| var1 | var2 | n | r (kendall) | CI95% | alternative | p-val | power | |
|---|---|---|---|---|---|---|---|---|
| kendall Test of Correlation | Education | Classical music | 1002 | 0.0223 | [-0.03, 1.0] | greater | 0.2401 | 0.174 |
The p-value of both coefficients is below 5%. Hence, we may reject the null hypothesis. There is evidence that education and preference for classical music is positively correlated. However, the power is at 0.2 to 0.25, which indicates that the tests are not good at detecting a false null hypothesis
When testing correlations, we need to take into account potential confounding variables. Say, we would like to test if age is correlated with buying organic products. Then, we also have to take into account that age is correlated with income, which might be also correlated with buying (the more expensive) organic products:
The correlation of 0.21 might be partly due the positive correlation between age and income. Hence, we need to eliminate the effect from income. Partial correlation analysis offers a way to do this.
Partial correlation coefficient adjustment:
Let X,Y and Z be three variables. Suppose, we would like to examine the correlation between X and Y while controlling for Z.
The adjusted correlation coefficient is then:
$$r_{XY,Z}=\frac{r_{XY}-r_{XZ}\cdot r_{YZ}}{\sqrt{1-r_{XZ}^2}\cdot\sqrt{1-\cdot r_{YZ}^2}}$$
data_part=data[['Age','Height','Weight']]
data_part=data_part.dropna()
data_part['Age'].corr(data_part.Weight)
0.23708368338501684
tests.correlation.partial(data=data_part,var1='Age',var2='Height',covar=['Weight'])
| var1 | var2 | covar | n | r (pearson) | CI95% | alternative | p-val | |
|---|---|---|---|---|---|---|---|---|
| pearson Partial Correation Test | Age | Height | [Weight] | 978 | -0.072924 | [-0.14, -0.01] | two-sided | 0.022637 |
In order to test if a nominal and a nominal or ordinal variable are associated, we may use a Test of Independence: $\chi^2$ or an exact Test in case of $2\times 2$ contingency tables.
Hypotheses: H0: X and Y are independent and HA: X and Y are dependent
Example: Independence Test
Step 1. Research Question and Hypotheses
Does preference for classical music depend on gender?
H0: Gender and Preference are independent and HA: Gender and Preference are dependent
Step 2. Data preprocessing
data_ind=data[['Gender','Classical music']] #slice data
nan=dataprep.nan(data_ind) #nan
nan.analysis
| Column | Missing Values | ||
|---|---|---|---|
| Analysis Missing Values | Gender | 6 | |
| Classical music | 7 | ||
| Number of Rows with NaNs | 12 |
data_ind1=nan.drop
Step 3. Pre-analyses
Countplot:
sns.countplot(x='Gender',hue='Classical music',data=data_ind1)
<Axes: xlabel='Gender', ylabel='count'>
Contingency tables:
x,y=data_ind1['Gender'], data_ind1['Classical music']
describe.contingency(x,y,show='observed')
| Classical music | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
|---|---|---|---|---|---|
| Gender | |||||
| female | 78 | 157 | 157 | 112 | 87 |
| male | 60 | 89 | 123 | 77 | 58 |
describe.contingency(x,y,show='expected').round(2)
| Classical music | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
|---|---|---|---|---|---|
| Gender | |||||
| female | 81.72 | 145.68 | 165.81 | 111.92 | 85.87 |
| male | 56.28 | 100.32 | 114.19 | 77.08 | 59.13 |
describe.contingency(x,y,show='deviations')
| Classical music | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
|---|---|---|---|---|---|
| Gender | |||||
| female | -5.0% | 8.0% | -5.0% | 0.0% | 1.0% |
| male | 7.0% | -11.0% | 8.0% | -0.0% | -2.0% |
Step 4. Check assumptions
Assumptions Chi2 Independence Test
Assumptions Exact Independence Test
We need to use $\chi^2$ because we do not have a $2\times 2$ contingency table.
C1. Gender and classical music are both categorical -> True
C2. Sample Size = 998 -> True:
len(data_ind1)
998
C3. Expected frequencies > 5 (see contingency tables) -> True
C4. Independence (within variables) may be assumed to be true.
Step 5. Run test and interprete
Idea of the test: the test compares observed frequencies with the expected frequencies provided independence. If the difference is too large, we may reject the null hypothesis (independence).
How to compute expected frequencies? $$E_{row\:i,column\:j}=\frac{(observations\: row\: i)\times (observations\: column\: j)}{Total\: observations}$$
Example: row 1, column 1: $$E_{1,1}=\frac{(78+60)\cdot (78+157+157+112+87)}{998}\approx 81.72$$
How to compute the test statistic $\chi^2$?
for a $n\times m$ contingency table:
$$\chi^2=\frac{(observed_{1,1}-expected_{1,1})^2}{expected_{1,1}}+\dots+\frac{(observed_{n,m}-expected_{n,m})^2}{expected_{n,m}}$$
In our example:
$$\chi^2=\frac{(78-81.72)^2}{81.72}+\dots+\frac{(58-59.13)^2}{59.13}\approx 3.76$$
tests.independence.chi2(x,y)
| vars | no. categories | test | chi2 | dof | p-val | cramer | power | |
|---|---|---|---|---|---|---|---|---|
| Chi2 Tests | Gender | 2 | pearson | 3.758539 | 4.0 | 0.439669 | 0.061368 | 0.301622 |
| of Independence | Classical music | 5 | cressie-read | 3.764047 | 4.0 | 0.438879 | 0.061413 | 0.302043 |
| G(log-likelihood) | 3.776706 | 4.0 | 0.437068 | 0.061516 | 0.303009 | |||
| freeman-tukey | 3.787661 | 4.0 | 0.435505 | 0.061606 | 0.303846 | |||
| mod-log-likelihood | 3.799891 | 4.0 | 0.433764 | 0.061705 | 0.304780 | |||
| neyman | 3.828268 | 4.0 | 0.429746 | 0.061935 | 0.306948 |
We focus on the first (pearson) and the third (G) row both yield a p-value > 5%. Hence, we cannot reject the null hypothesis. However, as the power shows the test is not really good at detecting false null hypothesis.
Effect Size
Reference: Rea, L. M., and Parker, R. A. (1992). Designing and conducting survey research. San Francisco: Jossey-Boss.
Association: Nominal vs quantitative
For nominal vs quantitative variables, we can use the measure eta ($\eta$) or a point-biserial correlation when the nominal variable is binary. However, for analyses nominal vs. quantitative, it is often advisable to apply a regression model.
Which coefficient for which levels of measurement?
General correlation matrix:
data.columns
Index(['Music', 'Slow songs or fast songs', 'Dance', 'Folk', 'Country',
'Classical music', 'Musical', 'Pop', 'Rock', 'Metal or Hardrock',
...
'Age', 'Height', 'Weight', 'Number of siblings', 'Gender',
'Left - right handed', 'Education', 'Only child', 'Village - town',
'House - block of flats'],
dtype='object', length=150)
data_gen=data[['Gender','Classical music','Age','Height','Left - right handed']]
nominal=['Gender','Left - right handed']
ordinal=['Classical music']
data_gen=data_gen.dropna()
cm=describe.corrmat(data_gen,ordinal=ordinal, nominal=nominal,show_nominal=True,utri=False)
cm.table
| Classical music (ord) | Age | Height | Gender (nom) | Left - right handed (nom) | |
|---|---|---|---|---|---|
| Classical music (ord) | |||||
| Age | 0.0415 | ||||
| Height | -0.0126 | 0.1114 | |||
| Gender (nom) | -0.0097 | 0.1302 | 0.6842 | ||
| Left - right handed (nom) | 0.0658 | 0.0313 | 0.0727 | 0.0815 |
cm.coef
| Classical music (ord) | Age | Height | Gender (nom) | Left - right handed (nom) | |
|---|---|---|---|---|---|
| Classical music (ord) | |||||
| Age | spearman | ||||
| Height | spearman | pearson | |||
| Gender (nom) | rbc | pbc | pbc | ||
| Left - right handed (nom) | rbc | pbc | pbc | cramer |
cm.heatmap(rotx=30)
Repeated tests (e.g., for pairwise comparisons) inflate the family-wise error ($\alpha$ error).
Example: pairwise comparisons between 3 groups
For each test, we set the probability for a false-positive test result ($\alpha$) to 5%. Triple testing inflates this probability to 14.3%. In general:
We must take into account the inflated family-wise error. One strategy is to adjust the p-value. Popular methods are:
Bonferroni: $pval_{adj} = pval_{unadj}\cdot (number\: tests)$
Example:
Sidak : $pval_{adj} = 1-(1-pval_{unadj})^{(number\: tests)}$
Example:
Holm-Bonferroni:
Adjust the ith p-value according to the following rule depending on its ranking position: $$pval_{adj}=(number\:tests-rank_i+1)\cdot pval_{unadj}$$ for $i = 1,\dots, number\: of\: tests$
Example:
Benjamini Hochberg:
Multiply each p-value by the number of tests and divide it by its rank.
Example:
Pairwise t-tests with Bonferroni correction (see padjust):
data_pair=data[['Age','Education']]
data_pair.pairwise_tests(dv='Age', between='Education',alternative="two-sided",
interaction=False,padjust='bonf').round(3)
| Contrast | A | B | Paired | Parametric | T | dof | alternative | p-unc | p-corr | p-adjust | BF10 | hedges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Education | college/bachelor degree | currently a primary school pupil | False | True | 9.615 | 11.109 | two-sided | 0.000 | 0.000 | bonf | 1.199e+15 | 2.114 |
| 1 | Education | college/bachelor degree | doctorate degree | False | True | -6.195 | 4.408 | two-sided | 0.002 | 0.037 | bonf | 1.647e+06 | -1.967 |
| 2 | Education | college/bachelor degree | masters degree | False | True | -13.793 | 116.290 | two-sided | 0.000 | 0.000 | bonf | 1.19e+30 | -2.010 |
| 3 | Education | college/bachelor degree | primary school | False | True | 16.507 | 217.682 | two-sided | 0.000 | 0.000 | bonf | 7.296e+39 | 1.787 |
| 4 | Education | college/bachelor degree | secondary school | False | True | 6.707 | 350.933 | two-sided | 0.000 | 0.000 | bonf | 1.919e+08 | 0.543 |
| 5 | Education | currently a primary school pupil | doctorate degree | False | True | -10.909 | 7.691 | two-sided | 0.000 | 0.000 | bonf | 1.059e+05 | -5.738 |
| 6 | Education | currently a primary school pupil | masters degree | False | True | -17.016 | 18.713 | two-sided | 0.000 | 0.000 | bonf | 4.1e+25 | -3.539 |
| 7 | Education | currently a primary school pupil | primary school | False | True | -2.085 | 11.260 | two-sided | 0.061 | 0.909 | bonf | 1.742 | -0.713 |
| 8 | Education | currently a primary school pupil | secondary school | False | True | -7.437 | 9.647 | two-sided | 0.000 | 0.000 | bonf | 1.041e+10 | -1.619 |
| 9 | Education | doctorate degree | masters degree | False | True | -0.565 | 5.799 | two-sided | 0.593 | 1.000 | bonf | 0.452 | -0.156 |
| 10 | Education | doctorate degree | primary school | False | True | 11.349 | 4.438 | two-sided | 0.000 | 0.003 | bonf | 8.53e+14 | 5.627 |
| 11 | Education | doctorate degree | secondary school | False | True | 8.001 | 4.127 | two-sided | 0.001 | 0.017 | bonf | 3.556e+11 | 2.575 |
| 12 | Education | masters degree | primary school | False | True | 24.115 | 154.000 | two-sided | 0.000 | 0.000 | bonf | 2.525e+50 | 3.843 |
| 13 | Education | masters degree | secondary school | False | True | 18.464 | 89.234 | two-sided | 0.000 | 0.000 | bonf | 2.835e+58 | 2.677 |
| 14 | Education | primary school | secondary school | False | True | -13.625 | 127.922 | two-sided | 0.000 | 0.000 | bonf | 1.337e+34 | -1.181 |
Pairwise correlation tests with Benjamini Hochberg correction (see padjust):
data_pair2=data[['Age','Height','Weight']]
pg.pairwise_corr(data_pair2, method='pearson', alternative='greater', padjust='fdr_bh').round(3)
| X | Y | method | alternative | n | r | CI95% | p-unc | p-corr | p-adjust | BF10 | power | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Age | Height | pearson | greater | 988 | 0.115 | [0.06, 1.0] | 0.0 | 0.0 | fdr_bh | 54.741 | 0.976 |
| 1 | Age | Weight | pearson | greater | 987 | 0.238 | [0.19, 1.0] | 0.0 | 0.0 | fdr_bh | 2.084e+11 | 1.000 |
| 2 | Height | Weight | pearson | greater | 980 | 0.698 | [0.67, 1.0] | 0.0 | 0.0 | fdr_bh | 1.888e+140 | 1.000 |
data_pair2=data_pair2.dropna()
cm=describe.corrmat(data=data_pair2,utri=False,stars=True)
cm=describe.corrmat(data=data_pair2)
cm.heatmap()
cm.table
| Age | Height | Weight | |
|---|---|---|---|
| Age | 1.000000 | 0.114687 | 0.237084 |
| Height | 0.114687 | 1.000000 | 0.697786 |
| Weight | 0.237084 | 0.697786 | 1.000000 |
In the social sciences, we often aim at examining whether
a independent variable (X) has an effect on a dependent variable (Y). For instance, do marketing expenses (X) increase sales (Y)?
The idea of linear regression is to fit a line to the data. The theoretical equation of such a model is:
$$Sales = \beta_0+\beta_1\cdot Marketing+\varepsilon,$$where
Finding the optimal line:
Each line leads to specific errors. We would like to find the line that minimizes the errors. That is the line that is located nearest to the data.
How do we aggregate the errors?
In case of a simple linear regression (1 dependent Y, 1 independent X), the estimated regression coefficients can be computed as follows:
In the example above:
Interpreting the coefficients
We may want to add further independent variables to our model. The theoretical regression equation with k independent variables looks as follows:
Conceptually, we distinguish between target variables and control variables. Say, $X_1$ and $X_2$ are the variables of interest. We examine their influence on Y while controlling for $X_3,\dots,X_k$.
The math works similarly to the simple model but now we have to imagine a regression plane (or hyperplane).
Example:
$$Sales= \beta_0+\beta_1\cdot Marketing+\beta_2\cdot Quality+\varepsilon,$$
where Quality is measured by the average product lifespan.
#data
data2= pd.read_excel ('C:\\Users\\kauffeldt\\Dropbox\\Teaching\\3_Programme\\Data\\Small\\Marketing.xlsx')
#Separate into DV and IVs
X=data2[['Marketing','Quality']]
y=data2['Sales']
Categorical variables (nominal or ordinal) take on categories as values. A regression equation can only deal with numbers. Therefore, we have to convert these variables into indicator variables.
Example: Variable "eye color" that can take on the categories blue, brown, green.
However, we cannot use all 3 indicator variables because two always predict the third perfectly. Thid dependence would cause the regression model to break down. Hence, we have to drop one of the categories (which does not matter). The dropped category serves as reference category: all effects are measured with respect to this category. Dropping, e.g. blue, yields:
In section 2.1.5, we explain how dummy encoding is done in Python.
We will explain these steps with the help of the following example: We would like to examine the how marketing expenses affect sales while controlling for the reputation of a company.
Step 1. Theoretical regression equation:
$$Sales=\beta_0+\beta_1\cdot Marketing+\beta_2\cdot Reputation$$
Slice data and remove NaN
data_reg=data2[['Marketing','Sales','Reputation']]
data_reg=data_reg.dropna()
Separate data in dependent and independent
X,y=data_reg.drop('Sales',axis=1),data_reg['Sales']
Dummy encoding
Reputation takes on the categories low, medium, high. Hence, we have to dummy encode the matrix of the independents.
Define encoder:
enc=dataprep.onehot(cats=['Reputation'],drop=None)
Fit encoder to data and transform data:
enc.fit(X)
X_dum=enc.transform(X)
We decide to drop category low. Which category we drop does not matter as mentioned earlier.
X_dum=X_dum.drop('Reputation_Low',axis=1)
X_dum.head(3)
| Reputation_High | Reputation_Medium | Marketing | |
|---|---|---|---|
| 0 | 1.0 | 0.0 | 500.0 |
| 1 | 1.0 | 0.0 | 876.0 |
| 2 | 0.0 | 1.0 | 759.0 |
Assumptions Linear Regression
L1 and L2 refer to the variables. The remaining assumptions refer to errors. In order to test these assumptions, we fit the regression model.
reg=regression(X_dum,y)
L1. Linearity
reg.datafit.round(4)
| dv | dof resid | dof model | R2 | adj. R2 | omnibus (F) | omnibus (p-val) | LL | |
|---|---|---|---|---|---|---|---|---|
| linear reg. fit | Sales | 29.0 | 3.0 | 0.7077 | 0.6775 | 23.4053 | 0.0 | -266.3585 |
Degrees of freedom:
R2: $$R^2=\frac{Variation\: explained\: by\: the\: model}{Total\:variation}=\frac{(\hat{y}_1-mean\: y)^2+\dots+(\hat{y}_n-mean\: y)^2}{(y_1-mean\: y)^2+\dots+(y_n-mean\: y)^2}$$
Omnibus:
The omnibus test tests the null hypothesis that no independent affects the dependent against the alternative that some independnets affect the dependent.
ass=reg.asstest
ass
| test | statistic | p-val | |
|---|---|---|---|
| linear reg. | Jarque-Bera | 0.5845 | 0.7466 |
| assumptions | Breusch-Pagan | 15.5276 | 0.0014 |
| Durbin-Watson | 2.1078 | ||
| Ramsey RESET | 0.2877 | 0.8832 |
The Ramsey RESET test tests if there are non-linear dependencies between dependent and independents. It tests the null hypothesis that a non-linear model has
has a higher explanatory power than the linear one against the alternative that this is not the case.
The p-value is 55.7%>5%. We cannot reject the null hypothesis and found no evidence for relevant non-linear dependencies.
L2. No (Multi)collinearity
What is (multi)collinearity? -> will be answered in the lecture.
Collinearity: Correlation Matrix
describe.corrmat(X_dum,stars=True,utri=False).table
| Reputation_High | Reputation_Medium | Marketing | |
|---|---|---|---|
| Reputation_High | ** | ||
| Reputation_Medium | -0.559** | ||
| Marketing | 0.2885 | 0.1539 |
An indicator for collinearity (pairwise linearity) is a correlation coefficient greater than 0.7 or less than -0.7. This is not the case.
Multicollinearity: Variance inflation factors (VIF)
reg.vif
| var | vif | |
|---|---|---|
| variance inflation | intercept | 5.040589 |
| factors | Reputation_High | 1.838889 |
| Reputation_Medium | 1.726673 | |
| Marketing | 1.294894 |
Multicollinearity (if three or more independents have a strong linear relationship) inflates the variance if there are small changes. The VIFs indicate the magnitude of variance inflations. If they are below 10, we may assume that there is no multicollinearity.
L3. Exogeneity
Exogeneity is established through theoretical or qualitative arguments in observational studies, not by statistical tests. In randomized control trials, the treatment would be exogenous by design (if the trial was executed properly and subjects complied with the design, etc.).
L4. Homoscedasticity
What is homoscedasticity?
pred=reg.pred
resid=reg.resid
plt.scatter(pred,resid,color='blue')
plt.xlabel('Predicted')
plt.ylabel('Residual')
Text(0, 0.5, 'Residual')
ass
| test | statistic | p-val | |
|---|---|---|---|
| linear reg. | Jarque-Bera | 0.5845 | 0.7466 |
| assumptions | Breusch-Pagan | 15.5276 | 0.0014 |
| Durbin-Watson | 2.1078 | ||
| Ramsey RESET | 0.2877 | 0.8832 |
The Breusch-Pagan test tests the null hypothesis that the residuals have constant variance against the alternative that this is not the case.
The p-value is 0.14% < 5%. Therefore, we found evidence that assumption L4 is violated. This is supported by the scatterplot above.
L5. No Autocorrelation
ass
| test | statistic | p-val | |
|---|---|---|---|
| linear reg. | Jarque-Bera | 0.5845 | 0.7466 |
| assumptions | Breusch-Pagan | 15.5276 | 0.0014 |
| Durbin-Watson | 2.1078 | ||
| Ramsey RESET | 0.2877 | 0.8832 |
The Durbin-Watson test statistic tells us if the errors might be autocorrelated. If the DW statistics is between 1.5 and 2.5, we may assume that there is no autocorrelation. This is the case.
L6. Normality
plots.dist(resid,fig=[6,4],labelsize=14,ticksize=12,linewidth=1)
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
plots.qq(resid,fig=[6,4],labelsize=14,ticksize=12)
ass
| test | statistic | p-val | |
|---|---|---|---|
| linear reg. | Jarque-Bera | 0.5845 | 0.7466 |
| assumptions | Breusch-Pagan | 15.5276 | 0.0014 |
| Durbin-Watson | 2.1078 | ||
| Ramsey RESET | 0.2877 | 0.8832 |
The Jarque-Bera test tests the null hypothesis that there is normality against the alternative that errors are not normally distributed.
The p-value is 74.66% > 5%. We thus found no evidence that this assumption is violated.
reg.coef.round(4)
| coef | stand. coef | std err | t | P>|t| | [0.025 | 0.975] | ||
|---|---|---|---|---|---|---|---|---|
| linear reg. | intercept | 1138.7134 | -0.0334 | 322.998 | 3.525 | 0.001 | 478.109 | 1799.318 |
| coefficients | Reputation_High | 2456.2189 | 1.4808 | 438.050 | 5.607 | 0.000 | 1560.306 | 3352.132 |
| Reputation_Medium | 1073.9194 | -0.1078 | 379.661 | 2.829 | 0.008 | 297.425 | 1850.414 | |
| Marketing | 2.1386 | -1.3396 | 0.719 | 2.975 | 0.006 | 0.668 | 3.609 |
Estimated regression equation and coefficient interpretation
Significance
Column $P>|t|$ shows the p-values of the t-tests for each coefficient of the independents with hypotheses:
Effects
High has the strongest influence on sales (standardized coefficient is farthest away from zero). The least influence has medium.
Robustness
The standard error (std err) shows the variance of the coefficient estimates.
Outlier
plots.outlier(X_dum,y,dtype='multivariate',method='Cook')
out=outlier.multivariate(X_dum,y).show()
X_dum.loc[out]
| Reputation_High | Reputation_Medium | Marketing | |
|---|---|---|---|
| 1 | 1.0 | 0.0 | 876.0 |
| 16 | 1.0 | 0.0 | 150.0 |
| 19 | 1.0 | 0.0 | 332.0 |
| 22 | 1.0 | 0.0 | 900.0 |
| 29 | 1.0 | 0.0 | 300.0 |
y.loc[out]
1 6489.526396 16 5003.382446 19 6068.459539 22 4061.039692 29 3000.000000 Name: Sales, dtype: float64
There seem to be no measurement errors. We check if removing the influential observations leads to homscedasticity:
X_dum2, y2 = X_dum.drop(out), y.drop(out)
reg2=regression(X_dum2,y2)
reg2.asstest
| test | statistic | p-val | |
|---|---|---|---|
| linear reg. | Jarque-Bera | 1.6110 | 0.4469 |
| assumptions | Breusch-Pagan | 5.6365 | 0.1307 |
| Durbin-Watson | 1.8189 | ||
| Ramsey RESET | 0.9968 | 0.4323 |
Apparently, removing the influential observations led to homoscedasticity (p-value of Breusch-Pagan > 5%).
reg2.coef.round(4)
| coef | stand. coef | std err | t | P>|t| | [0.025 | 0.975] | ||
|---|---|---|---|---|---|---|---|---|
| linear reg. | intercept | 955.9213 | 0.0040 | 271.394 | 3.522 | 0.002 | 395.792 | 1516.051 |
| coefficients | Reputation_High | 1943.3811 | 1.4423 | 435.058 | 4.467 | 0.000 | 1045.465 | 2841.297 |
| Reputation_Medium | 910.5409 | -0.0621 | 309.514 | 2.942 | 0.007 | 271.736 | 1549.346 | |
| Marketing | 2.9179 | -1.3842 | 0.707 | 4.126 | 0.000 | 1.458 | 4.378 |
All 3 independents are still significant. However, the effect of high decreased because we only removed observations with reputation high.
Overfitting
Including additional independents to the model always increases $R^2$. However, it is not clear if this is due to a true causal effect or purely mechanical: if we add independents, we lose degrees of freedom. This may lead to an artificial increase of $R^2$. For instance, if there are no degrees of freedom (number of groups - 1 (k-1) = number of observations (n), we always end up with $R^2 = 1$, purely due to mechanical mathematical reasons. An artificial increase of the explanatory power is known as overfitting, which means that we fitted the model too much to the data.
A possibility to detect overfitting is the adjusted $R^2_{adj}$. In contrast to $R^2$, it takes into account the degress of freedom and may decrease when we including additional independent variables:
reg2.datafit
| dv | dof resid | dof model | R2 | adj. R2 | omnibus (F) | omnibus (p-val) | LL | |
|---|---|---|---|---|---|---|---|---|
| linear reg. fit | Sales | 24.0 | 3.0 | 0.758737 | 0.728579 | 25.158814 | 1.381691e-07 | -218.683211 |
Data fit without variable "reputation":
X_dum3=X_dum2.drop(['Reputation_High','Reputation_Medium'],axis=1)
reg3=regression(X_dum3,y2)
reg3.datafit
| dv | dof resid | dof model | R2 | adj. R2 | omnibus (F) | omnibus (p-val) | LL | |
|---|---|---|---|---|---|---|---|---|
| linear reg. fit | Sales | 26.0 | 1.0 | 0.555364 | 0.538263 | 32.474838 | 0.000005 | -227.242349 |
Conclusion:
The adjusted $R^2$ of the model without Reputation is less than that of the model with reputation. Hence, there is no indication that our model is overfitted.
Moderation Effect
Moderator = third variable that determines the magnitude of the effect.
Marketing might have a weaker effect on Sales if the company has less reputation:
Let's have a look at the groupwise regressions:
data_reg
data_reg=data_reg.dropna()
sns.lmplot(data=data_reg, x="Marketing", y="Sales", hue="Reputation",ci=None)
<seaborn.axisgrid.FacetGrid at 0x17c263e4710>
The slopes of reputation 'low' and 'high' seems fairly similar. The slope of 'medium' hower is steeper.
In order to test if there is indeed a moderation, we need to include the interactions $marketing\times medium$ and $marketing\times high$ to the model. If these interactions are significant, we found evidence for a moderation effect.
X_mod = X_dum2.copy()
X_mod['marketing_medium']=X_mod['Marketing']*X_mod['Reputation_Medium']
X_mod['marketing_high']=X_mod['Marketing']*X_mod['Reputation_High']
reg_mod=regression(X_mod,y2)
reg_mod.coef.round(4)
| coef | stand. coef | std err | t | P>|t| | [0.025 | 0.975] | ||
|---|---|---|---|---|---|---|---|---|
| linear reg. | intercept | 1283.6331 | 1.9366 | 384.307 | 3.340 | 0.003 | 486.629 | 2080.637 |
| coefficients | Reputation_High | 600.2960 | 0.4653 | 1487.051 | 0.404 | 0.690 | -2483.658 | 3684.250 |
| Reputation_Medium | 414.4227 | 0.0651 | 572.115 | 0.724 | 0.476 | -772.071 | 1600.916 | |
| Marketing | 1.5208 | -0.8239 | 1.356 | 1.122 | 0.274 | -1.291 | 4.332 | |
| marketing_medium | 1.7763 | -0.8233 | 1.615 | 1.100 | 0.283 | -1.572 | 5.125 | |
| marketing_high | 3.3688 | -0.8199 | 3.037 | 1.109 | 0.279 | -2.930 | 9.667 |
The p-values of both interactions terms are greater than 5%. Hence, we found no evidence for a moderation effect.
Omitted Variable Bias (OVB)
The results of a regression are biased if a relevant independent variable is omitted. A variable is relevant if the following conditions are met:
Our data set contains another variable that might be relevant: product quality measured by the average lifespan of products.
What would be the case if quality was relevant? For the sake of simplicity, we omit reputation in the following:
Hence, instead of estimating the true effect of marketing on sales ($\beta_1$), we estimated $\gamma_1$, which might be higher or lower than $\beta_1$ depending on the direction of the correlation between marketing and quality.
Let's check if quality leads to an omitted variable bias:
Condition 1: Is quality correlated with another independent? $\rightarrow$ correlation matrix.
X_ovb=X_dum.copy()
X_ovb['Quality']=data2['Quality']
describe.corrmat(X_ovb,stars=True,utri=False).table
| Reputation_High | Reputation_Medium | Marketing | Quality | |
|---|---|---|---|---|
| Reputation_High | ** | ** | ||
| Reputation_Medium | -0.559** | |||
| Marketing | 0.2885 | 0.1539 | **** | |
| Quality | 0.5841** | -0.0625 | 0.6951**** |
Quality is highly correlated with marketing.
Condition 2: Does quality affect sales? $\rightarrow$ regression
reg_ovb=regression(X_ovb,y)
reg_ovb.coef
| coef | stand. coef | std err | t | P>|t| | [0.025 | 0.975] | ||
|---|---|---|---|---|---|---|---|---|
| linear reg. | intercept | 882.1838 | 0.201650 | 294.417 | 2.996 | 0.006 | 279.097 | 1485.270 |
| coefficients | Reputation_High | 1681.7791 | 1.621210 | 456.236 | 3.686 | 0.001 | 747.223 | 2616.335 |
| Reputation_Medium | 880.3159 | 0.198334 | 338.118 | 2.604 | 0.015 | 187.713 | 1572.919 | |
| Marketing | 0.5870 | -1.363490 | 0.801 | 0.733 | 0.470 | -1.053 | 2.227 | |
| Quality | 398.1351 | -0.657704 | 127.010 | 3.135 | 0.004 | 137.966 | 658.304 |
The p-value of quality is lower than 5%. Together with condition 1, this suggests a omitted variable bias. Notice that marketing is not significant anymore when we control for quality.
Mediation Effect
Mediator = third variable through which an effect occurs
Remark: This example surely is to some degree artifical because it is not clear why marketing should cause higher quality. However, it is preferable to using a complete new data set.
Test if quality is a mediator:
data_med=X_ovb
data_med['Sales']=y
pg.mediation_analysis(data=data_med, x='Marketing', m='Quality',covar=['Reputation_High', 'Reputation_Medium'], y='Sales')
| path | coef | se | pval | CI[2.5%] | CI[97.5%] | sig | |
|---|---|---|---|---|---|---|---|
| 0 | Quality ~ X | 0.003897 | 0.000920 | 0.000211 | 0.002015 | 0.005779 | Yes |
| 1 | Y ~ Quality | 455.685659 | 99.042733 | 0.000077 | 253.120524 | 658.250793 | Yes |
| 2 | Total | 2.138628 | 0.718943 | 0.005857 | 0.668225 | 3.609031 | Yes |
| 3 | Direct | 0.587043 | 0.800796 | 0.469607 | -1.053314 | 2.227400 | No |
| 4 | Indirect | 1.551585 | 0.575687 | 0.004000 | 0.712026 | 3.237173 | Yes |
The indirect effect (also referred to as average causal mediation effect or ACME) of marketing on sales through mediator quality quantifies the estimated difference in sales resulting from a one-unit change in marketing through a sequence of causal steps in which marketing affects quality, which in turn affects sales. It is considered significant if the specified confidence interval does not include 0.
The indirect effect is highly significant, while the direct is not. This suggests a mediation effect.
The linear regression model works only properly when the dependent variable is quantitative. For categorical dependent variable it is advisable to use another model. The following table summarizes the suggested models.
Example: Logistic Regression
Logistic Regression deals with binary dependent variables. It predicts the probabilities that the dependent takes on the categories. For instance, we may predict gender based on height.
data_log=data[['Gender','Height']]
data_log=data_log.dropna()
data_log=data_log[data_log['Height']>70]
Visualize Logistic Regression:
plots.scatter(data_log['Height'],data_log['Gender'],regression='logistic',intext=True,pos=[5,0.5])
Results Logistic Regression:
reg_log=regression(pd.DataFrame(data_log['Height']),data_log['Gender'],method='logistic')
Optimization terminated successfully.
Current function value: 0.322937
Iterations 8
reg_log.coef.round(4)
| coef | exp(coef) | std err | z | P>|z| | [0.025 | 0.975] | ||
|---|---|---|---|---|---|---|---|---|
| logistic reg. | intercept | -56.9686 | 0.0000 | 3.541 | -16.090 | 0.0 | -63.908 | -50.029 |
| coefficients | Height | 0.3243 | 1.3831 | 0.020 | 16.016 | 0.0 | 0.285 | 0.364 |